Contract.info Content
Contract ID
Project Name
Report ID animal_plant.revio
Report Date 2024-07-18

1 Sequencing Platform Introduction

With the development of high-throughput sequencing technology, the third generation sequencing platforms have become new tools of genome research. PacBio is one of the third generation sequencing platforms, which can construct longer fragments and produce longer reads. Single-molecule real-time sequencing (SMRT) is a parallelized single molecule DNA sequencing method. Single-molecule real-time sequencing utilizes a zero-mode waveguide (ZMW). A single DNA polymerase enzyme is affixed at the bottom of a ZMW with a single molecule of DNA as a template. The ZMW is a structure that creates an illuminated observation volume that is small enough to observe only a single nucleotide of DNA being incorporated by DNA polymerase. Each of the four DNA bases is attached to one of four different fluorescent dyes. When a nucleotide is incorporated by the DNA polymerase, the fluorescent tag is cleaved off and diffuses out of the observation area of the ZMW where its fluorescence is no longer observable. A detector detects the fluorescent signal of the nucleotide incorporation, and the base call is made according to the corresponding fluorescence of the dye.

CCS (Circular Consensus Sequencing) produces a consensus sequence from multiple subreads of the same DNA molecule, with high accuracy and a quality value greater than 99.9% (Q30). PacBio Revio platforms are used for sequencing. (For more information, please refer to the Appendix).

Figure 1.1 PacBio CCS sequencing principle

2 Library Construction, Quality Control and Sequencing

After DNA quality control, the genomic DNA was fragmented to appropriate sizes, then SMRTbell cleanup beads were added to each tube of sheared DNA for purification. And then DNA fragments were damage repaired and end repaired. The SMRTbell library was produced by ligating hairpin-shaped sequencing adapters to both ends of the DNA fragments and any failed connections were removed using exonucleases. After size selection, using AMPure PB bead/gel cassette and AMPure PB beads purification steps, sequencing primer was annealed to the SMRTbell templates, followed by binding of the sequence polymerase to the annealed templates. The experimental procedures of DNA library preparation is as follows:

Figure 2.1 Library construction workflow

3 Raw Data Processing and Data Statistics

PacBio highly accurate long reads, known as HiFi reads, are produced by sequencing a single molecule of DNA multiple times. HiFi reads can be used across a wide range of SMRT Sequencing applications, from whole genome sequencing for de novo assembly, comprehensive variant detection, RNA sequencing and more.

HiFi reads are produced by calling consensus from subreads generated by multiple passes of the enzyme around a circularized template. This results in a HiFi read that is both long and accurate.

Figure 3.1 The generation of HiFi reads

4 Bioinformatics Analysis Pipeline

Figure 4.1 PacBio HiFi data analysis pipeline

5 Standard Analysis

5.1 Sequencing Read QC

Table 5.1 Statistics of HiFi sequencing data.

Showing 1 to 2 of 2 entries

  • Sample: Sample name.
  • Reads number: The number of sequencing reads.
  • Bases(bp): Number of sequencing nucleotides(bp).
  • Mean Length(bp): Average length of reads.
  • Longest(bp): Length of the longest read.
  • N10(bp): Sort the obtained reads in ascending order of length, and accumulate them one by one until the length of the reads is no less than 10% of the total length.
  • N50(bp):Sort the obtained reads in ascending order of length, and accumulate them one by one until the length of the reads is no less than 50% of the total length.

5.1.1 HiFi Reads Statistics

The distribution of HiFi reads length is shown in the figure below:

Figure 5.1 Distribution of HiFi reads length.

  Sequencing read information is recorded in: Result/01.LRQC

5.2 Mapping

5.2.1 Statistics of Reference Genome

Table 6.1 Statistics of reference genome

  • Reference: Reference name.
  • Seq number: Total number of the assembled sequences or scaffolds.
  • Total length: Total length of assembled genomic sequence.
  • GC content(%): GC content of the reference genome.
  • Gap rate(%): The ratio of unknown nucleotide(N) in the reference genome assembly.
  • N50 length: The length of scaffold N50, of which 50% of the sequence is higher than this value.
  • N90 length: The length of scaffold N90, of which 90% of sequence is higher than this value.
  • Total length NoN: Total length of assembled genomic sequence without N base.

5.2.2 Mapping Results

The processed clean reads are mapped to the reference genome by minimap2. The output bam file of minimap2 is then sorted and merged by Samtools. The sorted bam file is used to calculate mapping statistics, sequencing depth and coverage are summarized in the following table:

Table 6.2 Mapping rate, coverage and depth statistics

  • Sample: Sample names.
  • Clean Reads: The number of reads passed QC.
  • Mapped Reads: The number of reads mapped to the reference genome.
  • Mapping Rate: The ratio of bases mapped to the reference genome.
  • Mean Depth: Average sequencing depth.
  • 1X Coverage: 1X coverage, the ratio of base coverage length to the total length of the whole genome base.
  • 4X Coverage: 4X coverage, the ratio of base coverage length to the total length of the whole genome base.
  • 10X Coverage: 10X coverage, the ratio of base coverage length to the total length of the whole genome base.
  • 20X Coverage: 20X coverage, the ratio of base coverage length to the total length of the whole genome base.

5.2.3 Sequencing Depth & Coverage Distribution

Figure 6.1 Summary of mapping statistics of each chromosome.

The horizontal axis represents different chromosome, the mean sequence depth of each chromosome is indicated by the height of each chromosome bar (left vertical axis), while the fraction of covered base on each chromosome is indicated by scatter plot (right vertical axis).

  Results are stored in the following path: Result/02.Mapping

5.3 Structural Variations (SVs) Calling Results

Structural variations (SVs) are genomic variation with mutaions of relatively larger size (>50 bp), including deletions, duplications, insertions, inversions and translocations. SV can be the source of the individual difference and the disease susceptibility among different species.Detection of large genomic variation (SV) has proven challenging using short-read methods. Long-read approaches, such as sequencing with PacBio/Nanopore platform, can produce continuous reads spanning the large events and have shown promises to dramatically expand the ability to call structural variation.Novogene employs Sniffles for SV detection based on PacBio HiFi/Nanopore data, Sniffles is a structural variation detection tool which maintains great balance between accuracy and resolution, and therefore provide faithful SV detection result.

5.3.1 SV Statistics

Third-generation sequencing can provide long-read sequencing with high accuracy, sequencing read can span to highly repetitive and complex region. The sensitivity and accuracy of third-generation SV calling is much higher than next-generation sequencing.

Table 7.1 Summary of SV statistics

Showing 1 to 2 of 2 entries

  • Sample: Sample names.
  • Upstream: Variant overlaps 1-kb region upstream of transcription start site.
  • UTR5: Variant overlaps 5' end of untranslated region.
  • UTR3: Variant overlaps 3' end of untranslated region.
  • UTR5/UTR3: Variant overlaps both 5' and 3' end of untranslated region
  • Exonic: Variant overlaps a coding region.
  • Unknowns: Variant with unknown function.
  • ncRNA:Variant overlaps with non-coding RNA
  • Intronic: Variant overlaps intronic region.
  • Splicing: Variant overlaps intronic region and at most 5 bp away from the boundary of exon and intron
  • Downstream: Variant overlaps 1-kb region down stream of transcription end site.
  • Upstream/Downstream: Variant overlaps one gene's upstream region and another gene's downstream region at the same time.
  • Intergenic: Variant overlaps intergenic region.
  • Others: Other types of SV.
  • INS: Insertion.
  • DEL: Deletion.
  • INV: Inversion.
  • BND: Breakpoint.
  • DUP: Duplication.
  • Total: The total number of SVs.

5.3.2 Summary of SV Statistics

Figure 7.1 SV length distribution

The vertical axis represents the length of detected SVs, and the horizontal axis represents the ratio of SVs with certain length.

Figure 7.2 SV type overview.

The horizontal axis represents sample name, and the vertical axis represents the ratio of different types of SV.

Figure 7.3 Summary of SV locations.

This figure depicts the proportion of SV located regions.

5.3.3 SV Annotation

Table 7.2 SV annotation.

Showing 1 to 4 of 4 entries

The table above is only a preview of the full results, which could be found in directory: Result/03.Variants/SV

Please click here for annotation notes:

1.Chr: chromosome number.

2.Start: variant start site.

3.End: variant end site.

4.Ref: sequence of reference genome, the N in reference genome is ignored.

5.Alt: altered sequence, the BND, DEL INV and other types are 0.

6.GeneName: list of gene names involved in altered region.

7.Func: annotation of variant overlaped region(exonic, splicing, UTR5, UTR3, intronic, ncRNA_exonic, ncRNA_intronic, ncRNA_UTR3, ncRNA_UTR5, ncRNA _splicing, upstream, downstream, intergenic). Note :1) exonic include coding, UTR3 anUTR5; 2) once the variant located in a region with multiple functions, the order of annotation lised based on the significance of the function: Exonic = splicing > ncRNA> > UTR5/UTR3 > intron > upstream/downstream > intergenic. UTR5,UTR3 is for variant overlaps a region which is UTR5 of gene and UTR3 of a second gene at the same time. "upstream, downstream" indicates variant overlaps a region which is the upstream region of a gene and the downstream region of a second gene at the same time,

8.Gene: The transcript name(s). If a variant has 'intergenic' in 'Func' field, this field will give two neighboring transcripts. If a variant hits multiple transcripts with different functional categories, only transcript names in accordance with the value of 'Func' field will be output. For example, rs333970 hits the exonic, splicing, intronic, exonic of the four transcripts of gene CSF1, the 'Func' value will be 'exonic; splicing' and the 'Gene' value will be 'NM_000757, NM_172210, NM_172212' (NM_172211 will be ignored).

9.GeneDetail: description of variant impact on the transcript. Note: once the variant overlaps intergenic region, the value of "dist" represents the distance between variant and nearby gene.

10.ExonicFunc: functional effect of SNV and InDel (SNV include synonymous_SNV, missense_SNV, stopgain, stopgloss and unknown; InDel include frameshift insertion, frameshift deletion, stopgain, stoploss, nonframeshift insertion, nonframeshift deletion and unknown).

11.AAChange: amino acid change caused by variant.

12.Otherinfo1: genotype, Homozygous: 0/0 corresponds to 0, and 1/1 corresponds to 1; Hybrid: 0/1 corresponds to 0.5.

  1. Otherinfo2: QUAL in VCF files, is the Phred-scaled probability that the site has no variant and is computed as: Phred = -10 * log (1-p), p is the probability that variant exists; The higher the value, the more likely it is to be variant.

14.Otherinfo3: '.' .

15.Otherinfo4: SV name detected by caller.

16.Otherinfo5: if BND ,sequence of reference genome.

17.Otherinfo6: pos of reference genome.

18.Otherinfo7: SV description, INFO in the VCF file.

19.Otherinfo8: FORMAT in the VCF file: deifferent parameters are separated by ":". GT:Genotype; 0: no alteration detected; 1,2,3 indicate detetected allele is different from reference allele. Homozygous:0/0,1/1;Heterozygous:0/1. GQ: Conditional genotype quality. DR: High-quality reference reads. DV: High-quality variant reads.

20.Otherinfo9: The FORMAT value corresponding to Otherinfo8.

6 Advanced Analysis

6.1 SNP Detection & Annotation

Single-nucleotide polymorphism (SNP) is a germline variation of a single nucleotide at a specific position in the genome. SNP is one of the most common form of genetic variation. Novogene employs DeepVariant, an open source tool developed by Google to detect SNP. DeepVariant uses a deep neural network to call genetic variants from third-generation sequencing data. Compared with other genetic variant detection tools, DeepVariant can process data generated from different sequencing platforms with shorter running time and high accuracy. The sorted bam file was analyzed by DeepVariant for SNP detection, the output data of DeepVariant is saved as VCF format (https://samtools.github.io/hts-specs/VCFv4.2.pdf), which is further annotated by ANNOVAR.

6.1.1 SNP Calling Results

Table 8.1 SNP statistics

  • Sample: Sample names.
  • Upstream: Variant overlaps 1-kb region upstream of transcription start site.
  • UTR5: Variant overlaps 5' end of untranslated region.
  • UTR3: Variant overlaps 3' end of untranslated region.
  • UTR5;UTR3: Variant overlaps both 5' and 3' end of untranslated region.
  • Exonic: Variant overlaps a coding region. Stop gain: Variant that leads to the immediate creation of stop codon at the variant site. Stop loss: Variant that leads to the immediate elimination of stop codon at the variant site. Non-synonymous: Variant that causes an amino acid change. Synonymous: Variant that does not cause an amino acid change. unknowns: Variant with unknown function.
  • ncRNA: Variant overlaps with non-coding RNA.
  • Intronic: Variant overlaps intronic region.
  • Splicing: Variant overlaps intronic region and at most 5 bp away from the boundary of exon and intron.
  • Downstream: Variant overlaps 1-kb region down stream of transcription end site.
  • Upstream/Downstream: Variant overlaps one gene's upstream region and another gene's downstream region at the same time.
  • Intergenic: Variant overlaps intergenic region.
  • Others: Other types of variant.
  • ts: Transition mutation.
  • tv: Transversion mutation.
  • ts/tv: The ratio of ts versus tv.
  • Het rate: The ratio of SNP heterozygosity, which is calculated through heterozygous SNP divided by whole genome size.
  • Total: The total number of SNPs.

6.1.2 Statistics of SNP Classification

SNP can be classified into six types. For instance, T:A >C:G indicate two, which is T>C and A>G. Because some mutations can be mapped to both forward and reverse strand, and the T>C on the forward strand equals to A>G on the reverse strand.

Figure 8.1 SNP type statistics.

The horizontal axis represents SNP number, while the vertical axis represents SNP type.

Figure 8.2 Ratio of different SNP type in different samples.

The horizontal axis represents sample name, while the vertical axis represents ratio of different types of SNP.

Figure 8.3 Summary of variants location.

This figure depicts the ratio of different SNPs located in different genomic regions.

6.1.3 Results of SNP Annotation

Table 8.2 Results of SNP Annotation.

The table above is only a preview of the full results, which could be found in directory: Result/03.Variants/SNP

Please click here for annotation notes:

1.Chr: chromosome number.

2.Start: variant start site.

3.End: variant end site.

4.Ref: sequence of reference genome, the N in reference genome is ignored.

5.Alt: altered sequence, the BND, DEL INV and other types are 0.

6.GeneName: list of gene names involved in altered region.

7.Func: annotation of variant overlaped region(exonic, splicing, UTR5, UTR3, intronic, ncRNA_exonic, ncRNA_intronic, ncRNA_UTR3, ncRNA_UTR5, ncRNA _splicing, upstream, downstream, intergenic). Note :1) exonic include coding, UTR3 anUTR5; 2) once the variant located in a region with multiple functions, the order of annotation lised based on the significance of the function: Exonic = splicing > ncRNA> > UTR5/UTR3 > intron > upstream/downstream > intergenic. UTR5,UTR3 is for variant overlaps a region which is UTR5 of gene and UTR3 of a second gene at the same time. "upstream, downstream" indicates variant overlaps a region which is the upstream region of a gene and the downstream region of a second gene at the same time,

8.Gene: The transcript name(s). If a variant has 'intergenic' in 'Func' field, this field will give two neighboring transcripts. If a variant hits multiple transcripts with different functional categories, only transcript names in accordance with the value of 'Func' field will be output. For example, rs333970 hits the exonic, splicing, intronic, exonic of the four transcripts of gene CSF1, the 'Func' value will be 'exonic; splicing' and the 'Gene' value will be 'NM_000757, NM_172210, NM_172212' (NM_172211 will be ignored).

9.GeneDetail: description of variant impact on the transcript. Note: once the variant overlaps intergenic region, the value of "dist" represents the distance between variant and nearby gene.

10.ExonicFunc: functional effect of SNV and InDel (SNV include synonymous_SNV, missense_SNV, stopgain, stopgloss and unknown; inDel include frameshift insertion, frameshift deletion, stopgain, stoploss, nonframeshift insertion, nonframeshift deletion and unknown).

11.AAChange: amino acid change caused by variant.

12.Otherinfo1: genotype, Homozygous: 0/0 corresponds to 0, and 1/1 corresponds to 1; Hybrid: 0/1 corresponds to 0.5.

13.Otherinfo2: QUAL in VCF files, is the Phred-scaled probability that the site has no variant and is computed as: Phred = -10 * log (1-p), p is the probability that variant exists; The higher the value, the more likely it is to be variant.

14.Otherinfo3: sequencing depth of the variant.

15.Otherinfo4: chromosome number.

16.Otherinfo5: variant start site.

17.Otherinfo6: variant end site.

18.Otherinfo7: reference genomic base type.

19.Otherinfo8: sample genome base type.

20.Otherinfo9: Generally, the data in the VCF file is filtered appropriately before being used for variant callset (data set of mutation sites). Three explanations are given on whether the filtering is completed: first, the mutation sites that fail to pass the filtering are given; Second, PASS indicates that all filters pass. The third is that there is no filtering at this site.

21.Otherinfo10: INFO in the VCF file.

22.Otherinfo11: FORMAT in the VCF file, deifferent parameters are separated by ":". GT:Genotype; 0: no alteration detected; 1,2,3 indicate detetected allele is different from reference allele. Homozygous:0/0,1/1;Heterozygous:0/1. PL: Normalized Phred-scaled likelihoods. DP: Read depth. AD: Read depth for reference and variant allele. GQ: Conditional genotype quality. VAF: Variant allele fractions.

23.Otherinfo12: The FORMAT value corresponding to Otherinfo11.

6.2 InDel Calling & Annotation

An InDel is a short polymorphism that corresponds to the insertions or deletions of the genome sequence, its length is normally shorter than 50bp, and most of them are shorter than 10bp. An INDEL may be in coding or non-coding DNA, changing the resulting protein sequence or transcriptional activity. Novogene employs DeepVariant to call InDel, and ANNOVAR to annotate the calling result.

6.2.1 InDel Calling Statistics

Table 8.3 InDel statistics.

The table above is only a preview of the full results, which could be found in directory: Result/03.Variants/Indel

  • Sample: Sample names.
  • Upstream: Variant overlaps 1-kb region upstream of transcription start site.
  • UTR5: Variant overlaps 5' end of untranslated region.
  • UTR3: Variant overlaps 3' end of untranslated region.
  • UTR5;UTR3: Variant overlaps both 5' and 3' end of untranslated region.
  • Exonic: Variant overlaps a coding region. Stop gain: Variant that leads to the immediate creation of stop codon at the variant site. Stop loss: Variant that leads to the immediate elimination of stop codon at the variant site. Non-synonymous: Variant that causes an amino acid change. Synonymous: Variant that does not cause an amino acid change. unknowns: Variant with unknown function.
  • Intronic: Variant overlaps intronic region.
  • splicing: variant overlaps intronic region and at most 5 bp away from the boundary of exon and intron.
  • Downstream: Variant overlaps 1-kb region down stream of transcription end site.
  • upstream/downstream: Variant overlaps one gene's upstream region and another gene's downstream region at the same time.
  • Intergenic: Variant overlaps intergenic region.
  • Insertion: Variant type.
  • Deletion: Variant type.
  • Het rate: InDel heterozygous rate, calculated by the ratio of InDels to the total number of genome bases.
  • Total: The total number of InDels.

6.2.2 Length Distribution of InDel

The length distribution of InDels for the samples in the sample list

Figure 8.4 Length distribution of InDels.

The x-axis represents the proportion of the InDels with a certain length, and y-axis indicates the length of the InDels.

Figure 8.5 The number of InDels in different regions of the genome.

The number of InDels in different regions of the genome(left) and the number of InDels of different types in the coding region(right) are distributed.

6.2.3 Results of InDel Annotation

Table 8.4 InDels with Annotation.

Please click here for annotation notes

1.Chr: chromosome number.

2.Start: variant start site.

3.End: variant end site.

4.Ref: sequence of reference genome, the N in reference genome is ignored.

5.Alt: altered sequence, the BND, DEL INV and other types are 0.

6.GeneName: list of gene names involved in altered region.

7.Func: annotation of variant overlaped region(exonic, splicing, UTR5, UTR3, intronic, ncRNA_exonic, ncRNA_intronic, ncRNA_UTR3, ncRNA_UTR5, ncRNA _splicing, upstream, downstream, intergenic). Note :1) exonic include coding, UTR3 anUTR5; 2) once the variant located in a region with multiple functions, the order of annotation lised based on the significance of the function: Exonic = splicing > ncRNA> > UTR5/UTR3 > intron > upstream/downstream > intergenic. UTR5,UTR3 is for variant overlaps a region which is UTR5 of gene and UTR3 of a second gene at the same time. "upstream, downstream" indicates variant overlaps a region which is the upstream region of a gene and the downstream region of a second gene at the same time,

8.Gene: The transcript name(s). If a variant has 'intergenic' in 'Func' field, this field will give two neighboring transcripts. If a variant hits multiple transcripts with different functional categories, only transcript names in accordance with the value of 'Func' field will be output. For example, rs333970 hits the exonic, splicing, intronic, exonic of the four transcripts of gene CSF1, the 'Func' value will be 'exonic; splicing' and the 'Gene' value will be 'NM_000757, NM_172210, NM_172212' (NM_172211 will be ignored).

9.GeneDetail: description of variant impact on the transcript. Note: once the variant overlaps intergenic region, the value of "dist" represents the distance between variant and nearby gene.

10.ExonicFunc: functional effect of SNV and InDel (SNV include synonymous_SNV, missense_SNV, stopgain, stopgloss and unknown; InDel include frameshift insertion, frameshift deletion, stopgain, stoploss, nonframeshift insertion, nonframeshift deletion and unknown).

11.AAChange: amino acid change caused by variant.

12.Otherinfo1: genotype, Homozygous: 0/0 corresponds to 0, and 1/1 corresponds to 1; Hybrid: 0/1 corresponds to 0.5.

13.Otherinfo2: The mass value of QUAL,Phred format (Phred_scaled) in the VCF file indicates the possibility of variant at this site; The higher the value, the more likely it is to be variant. Calculation method: Phred = -10 * log (1-p), p is the probability that variant exists; It can be seen from the calculation formula that the representation error probability of a value of 10 is 0.1, and the probability of the site being a variant is 90%.

14.Otherinfo3: sequencing depth of the variant.

15.Otherinfo4: chromosome number.

16.Otherinfo5: variant start site.

17.Otherinfo6: variant end site.

18.Otherinfo7: reference genomic base type.

19.Otherinfo8: sample genome base type.

20.Otherinfo9: Generally, the data in the VCF file is filtered appropriately before being used for variant callset (data set of mutation sites). Three explanations are given on whether the filtering is completed: first, the mutation sites that fail to pass the filtering are given; Second, PASS indicates that all filters pass. The third is that there is no filtering at this site.

21.Otherinfo10: INFO in the VCF file.

22.Otherinfo11: FORMAT in the VCF file, deifferent parameters are separated by ":". GT:Genotype; 0: no alteration detected; 1,2,3 indicate detetected allele is different from reference allele. Homozygous:0/0,1/1;Heterozygous:0/1. PL: Normalized Phred-scaled likelihoods. DP: Read depth. AD: Read depth for reference and variant allele. GQ: Conditional genotype quality. VAF: Variant allele fractions.

23.Otherinfo12: The FORMAT value corresponding to Otherinfo11.

6.3 Results of CNV Calling

Copy number variation (CNV) refers to a circumstance in which the number of copies of a specific segment of DNA varies among different individuals’ genomes. The individual variants may be short or include thousands of bases. These structural differences may have come about through duplications, deletions or other changes and can affect long stretches of DNA. Such regions may or may not contain a gene(s). Novogene uses CNVkit to call genome-wide CNVs. CNVkit implements a pipeline for CNV detection that takes advantage of both on– and off-target sequencing reads and applies a series of corrections to improve the accuracy in copy number calling.

Table 9.1 Statistics of CNV calling result

  • Sample: Sample name.
  • DEL: Number of deletion.
  • DUP: Number of duplication.
  • Total_CNV_nums: Total CNV count.
  • Total_CNV_len: Total CNV length.
  • mean_CNV_len: Average CNV length.

Results recorded in the following path: Result/03.Variants/CNV

6.4 Visualization of Variants by Circos Plot

The plot below is generated by Circos, we can visualize the variants (SNP, InDel, SV) distribution across the whole genome.

Figure 10.1 Mutation map of the whole genome map.

The outermost circle is the position coordinates of the genome sequence, from outside to inside, respectively, SNP density distribution, InDel density distribution(if SNP and InDel have not been analyzed, the density will not be shown), and the distribution density display of structural variation(SV) type, in order: SV insertion(INS), SV deletion(DEL), SV inversion(INV), SV repetition(DUP), SV translocation(BND).

The figure below shows the densities of SNP and InDel on different chromosomes.

Figure 10.2 Heatmap of SNP/InDel densities of all chromosomes, 100kb window.

Results restored in the following path: Result/03.Variants/Circos

7 References

[1] Pacific Biosciences of California(2010 - 2014). Template Preparation and Sequencing Guide, © Copyright 2010 - 2014, Inc. All rights reserved.

[2] Anthony Rhoads,Kin Fai Ausuch(2015). PacBio Sequencing and Its Applications, Genomics Proteomics Bioinformatics13(2015) 278–289.

[3] Pacific Biosciences of California. Template Preparation and Sequencing Guide, ©Copyright 2010-2014, Inc. All rights reserved. (PacBio)

[4] Pacific Biosciences of California. SMRT Tools Reference Guide, ©2015-2018, Pacific Biosciences of California, Inc. All rights reserved. (PacBio)

[5] Anthony Rhoads,Kin Fai Ausuch. PacBio Sequencing and Its Applications, Genomics Proteomics Bioinformatics13(2015) 278–289. (PacBio)

[6] Sherry S T, Ward M H, Kholodov M, et al. dbSNP: the NCBI database of genetic variation[J]. Nucleic acids research, 2001, 29(1): 308-311. (dbSNP)

[7] Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data[J]. Nucleic acids research, 2010, 38(16): e164-e164. (ANNOVAR)

[8] Sedlazeck FJ, Rescheneder P, Smolka M, Fang H, Nattestad M, von Haeseler A, Schatz MC. Accurate detection of complex structural variations using single-molecule sequencing. Nature Methods. 2018 Jun;15(6):461-468. (Sniffles)

[9] A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology 36, 983-987 (2018). (DeepVariant)

[10] Talevich, Eric, Shain, Hunter, A., & Botton. CNVkit: Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing. PLoS Computational Biology. 2016, 12. (CNVkit)

8 Appendix

8.1 PacBio Platform Advantage

  1. Long HiFi Read Length: Average read length is 10-20 kb, easily cross over high repetition and low complexity area;
  2. Great Capacity(Revio Platform): Redesigned SMRT Cells contain twenty-five million zero-mode waveguide (ZMW);
  3. No GC Bias: With the longer reads, the read set will span GC-rich and GC-poor regions;
  4. No PCR Amplification Bias: PCR expansion is not necessary in the process of building library;
  5. Simultaneous Epigenetic Characterization: Directly determine the DNA modifications using polymerase kinetics.

9 Glossary

9.1 SMRT® Cell

Consumable substrates comprising arrays of zero-mode waveguide(ZMW) nanostructures. SMRT Cells are used in conjunction with the DNA Sequencing Kit for oninstrument DNA sequencing.

9.2 Collection

The set of data collected during realtime observation of the SMRT® Cell; including spectral information and temporal information used to determine a read.

9.3 Zero-mode Waveguide (ZMW)

A nanophotonic device for confining light to a small observation volume. This can be, for example, a small hole in a conductive layer whose diameter is too small to permit the propagation of light in the wavelength range used for detection. Physically part of a SMRT®Cell.

9.4 Subreads

Each polymerase read is partitioned to form one or more subreads, which contain sequence from a single pass of a polymerase on a single strand of an insert within a SMRTbell template and noadapter sequences. The subreads contain the full set of quality values and kinetic measurements. Subreads are useful for applications such as de novo assembly, resequencing, base modification analysis, and so on.

9.6 Note

It is recommended to open the result files with professional text editors such as Excel or Emacs.